Context The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Objective:
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
As a Data scientist at Thera bank, we need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
We need to identify the best possible model that will give the required performance
Data Dictionary
CLIENTNUM: Client number. Unique identifier for the customer holding the account
Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
Customer_Age: Age in Years
Gender: Gender of the account holder
Dependent_count: Number of dependents
Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.
Marital_Status: Marital Status of the account holder
Income_Category: Annual Income Category of the account holder
Card_Category: Type of Card
Months_on_book: Period of relationship with the bank Total_Relationship_Count: Total no. of products held by the customer
Months_Inactive_12_mon: No. of months inactive in the last 12 months
Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months
Credit_Limit: Credit Limit on the Credit Card
Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance
Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)
Total_Trans_Amt: Total Transaction Amount (Last 12 months)
Total_Trans_Ct: Total Transaction Count (Last 12 months)
Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter
Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter Avg_Utilization_Ratio: Represents how much of the available credit the customer spent
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
!pip install lightgbm
import lightgbm as lgb
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
)
# Import plot_confusion_matrix and plot_roc_curve from their updated locations
from sklearn.metrics import ConfusionMatrixDisplay, RocCurveDisplay
# You can then use them like this:
# ConfusionMatrixDisplay.from_estimator(estimator, X_test, y_test)
# or
# ConfusionMatrixDisplay.from_predictions(y_true, y_pred)
# Similarly for plot_roc_curve:
# RocCurveDisplay.from_estimator(estimator, X_test, y_test)
# or
# RocCurveDisplay.from_predictions(y_true, y_pred)
from sklearn.preprocessing import (
StandardScaler,
MinMaxScaler,
OneHotEncoder,
RobustScaler,
)
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import TransformerMixin
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
pd.set_option("display.max_columns", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
!pip install ydata-profiling # Install the package providing pandas profiling functionality
from ydata_profiling import ProfileReport # Update the import statement to reflect the correct package and module name
Loadind Data Set
df = pd.read_csv("/content/BankChurners.csv")
df.shape
Observation: Dataset has 10127 columns and 21 rows
df.head()
df.tail()
df.describe()
additional_droppable_columns = [
'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'
]
for col in additional_droppable_columns:
if col in df.columns.unique().tolist():
df.drop(columns=[col], inplace=True)
data = df.copy()
data.head()
data.tail()
data.info()
Observation: It is onserved that education level and marital status is less then 10127
data.duplicated().sum()
Observation: IT is observed that there is 0 duplicated values
print("Missing Values Count per Column:")
print(df.isnull().sum())
Observation: It is observed that except education level and marital status all other values are zeros
Unique Categorial Variable
data.select_dtypes(include="object").nunique()
Numerical unique variable
data.select_dtypes(exclude="object").nunique()
Observation : unique value is of age =45 means customers are all of similar age
data.describe()
Observation:
Below observations are noted
Mean value for the Customer Age column is approx 46 and the median is also 45. This shows that majority of the customers are under 45 years of age. Dependent Count column has mean and median of ~2 Months on Book column has mean and median of 36 months. Minimum value is 13 months, showing that the dataset captures data for customers with the bank at least 1 whole years Total Relationship Count has mean and median of ~4 Credit Limit has a wide range of 1.4K to 34.5K, the median being 4.5K, way less than the mean 8.6K Total Transaction Count has mean of ~65 and median of 67
data.describe(include='object')
def category_unique_value():
for cat_cols in (
data.select_dtypes(exclude=[np.int64, np.float64]).columns.unique().to_list()
):
print("Unique values and corresponding data counts for feature: " + cat_cols)
print("-" * 90)
df_temp = pd.concat(
[
data[cat_cols].value_counts(),
data[cat_cols].value_counts(normalize=True) * 100,
],
axis=1,
)
df_temp.columns = ["Count", "Percentage"]
print(df_temp)
print("-" * 90)
category_unique_value()
Observation:
It is observed that 93% of customers are blue. 5% are silver and 1% are gold and les then 1% are platinum card holders
PRE EDA DATA PROCESSING
data.drop(columns=["CLIENTNUM"],inplace=True)
# Check the actual column names in your DataFrame
print(data.columns)
# Assuming there is a typo or whitespace issue, try the following:
# Ensure correct casing and spacing
marital_status_col = next((col for col in data.columns if col.lower().strip() == 'marital_status'), None)
if marital_status_col:
data['Marital_status'] = data[marital_status_col].fillna('unknown')
else:
print("Warning: 'Marital_status' column not found in the DataFrame.")
# Investigate data loading/preprocessing steps if the column is truly missing
data.loc[data[data["Income_Category"] == "abc"].index, "Income_Category"] = "Unknown"
category_unique_value()
print(df.isnull()) # Boolean mask
print(df.isnull().sum()) # Count missing values per column
print(df.info()) # Summary including non-null counts
df_null_summary = pd.concat(
[data.isnull().sum(), data.isnull().sum() * 100 / data.isnull().count()], axis=1
)
df_null_summary.columns = ["Null Record Count", "Percentage of Null Records"]
df_null_summary[df_null_summary["Null Record Count"] > 0].sort_values(
by="Percentage of Null Records", ascending=False
).style.background_gradient(cmap="YlOrRd")
category_columns = data.select_dtypes(include="object").columns.tolist()
data[category_columns] = data[category_columns].astype("category")
data.columns = [i.replace(" ", "_").lower() for i in data.columns]
data.info()
# Instead of using numerical indices, directly specify the columns to drop by name.
# Assuming you intend to drop the last two columns:
data.drop(columns=data.columns[-2:], inplace=True)
# data.columns[-2:] selects the last two column names.
data.head(2)
Exploratory Data Analysis
Univariate Analysis
summary(data, "customer_age")
Observation: It is observed thayt data is uniformly distributed
summary(data, "dependent_count")
Observation: It is observed that dependent count is 2 and 3
def summary(data: pd.DataFrame, x: str):
"""
The function prints the 5 point summary and histogram, box plot,
violin plot, and cumulative density distribution plots for each
feature name passed as the argument.
Parameters:
----------
x: str, feature name
Usage:
------------
summary('age')
"""
x_min = data[x].min()
x_max = data[x].max()
Q1 = data[x].quantile(0.25)
Q2 = data[x].quantile(0.50)
Q3 = data[x].quantile(0.75)
dict = {"Min": x_min, "Q1": Q1, "Q2": Q2, "Q3": Q3, "Max": x_max}
df = pd.DataFrame(data=dict, index=["Value"])
print(f"5 Point Summary of {x.capitalize()} Attribute:\n")
# Assuming 'tabulate' is available, if not, install it using: !pip install tabulate
from tabulate import tabulate
print(tabulate(df, headers="keys", tablefmt="psql"))
fig = plt.figure(figsize=(16, 8))
plt.subplots_adjust(hspace=0.6)
sns.set_palette("Pastel1")
# Corrected indentation here:
plt.subplot(221, frameon=True)
ax1 = sns.distplot(data[x], color="purple")
ax1.axvline(
np.mean(data[x]), color="purple", linestyle="--"
) # Add mean to the histogram
ax1.axvline(
np.median(data[x]), color="black", linestyle="-"
) # Add median to the histogram
plt.title(f"{x.capitalize()} Density Distribution")
plt.subplot(222, frameon=True)
ax2 = sns.violinplot(x=data[x], palette="Accent", split=True)
plt.title(f"{x.capitalize()} Violinplot")
plt.subplot(223, frameon=True, sharex=ax1)
ax3 = sns.boxplot(
x=data[x], palette="cool", width=0.7, linewidth=0.6, showmeans=True
)
plt.title(f"{x.capitalize()} Boxplot")
plt.subplot(224, frameon=True, sharex=ax2)
ax4 = sns.kdeplot(data[x], cumulative=True)
plt.title(f"{x.capitalize()} Cumulative Density Distribution")
plt.show()
data['gender'].value_counts()
sns.countplot(data=df, x='Gender')
plt.pie(df['Gender'].value_counts(), labels = ['Female', 'Male'], autopct='%1.1f%%', shadow = True, startangle = 90)
plt.title('Proportion of Gender count', fontsize = 16)
plt.show()
Observation:
It is observed that gender distribution is normal
summary(data, "total_relationship_count")
Observation:
It is observed that customers have 4 or more relations with the bank
import matplotlib.pyplot as plt # Import the matplotlib.pyplot module
plt.pie(data['attrition_flag'].value_counts(), labels = ['Existing Customer', 'Attrited Customer'],
autopct='%1.1f%%', startangle = 90)
plt.title('Proportion of Existing and Attrited Customer count', fontsize = 16)
plt.show()
edu = data['education_level'].value_counts().to_frame('Counts')
plt.figure(figsize = (8,8))
# Use edu.index for x-axis and edu['Counts'] for y-axis
plt.plot(edu.index, edu['Counts'], marker='o') # Added marker for better visualization
plt.title('Proportion of Education Levels', fontsize = 18)
plt.xlabel('Education Level') # Added x-axis label
plt.ylabel('Counts') # Added y-axis label
plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for readability
plt.show()
plt.figure(figsize=(10,6))
sns.countplot(x='Attrition_Flag', hue='Marital_Status', data=df)
plt.title('Attrited and Existing Customers by Marital Status', fontsize=20)
Observation: ' It is observed that Married people are more in the customers list
def summary(data: pd.DataFrame, x: str):
"""
The function prints the 5 point summary and histogram, box plot,
violin plot, and cumulative density distribution plots for each
feature name passed as the argument.
Parameters:
----------
x: str, feature name
Usage:
------------
summary('age')
"""
# Convert the column name to lowercase to handle case sensitivity
x = x.lower()
# Check if the column exists in the DataFrame
if x not in data.columns:
print(f"Error: Column '{x}' not found in the DataFrame.")
return # Exit the function if the column is not found
x_min = data[x].min()
x_max = data[x].max()
Q1 = data[x].quantile(0.25)
Q2 = data[x].quantile(0.50)
Q3 = data[x].quantile(0.75)
dict = {"Min": x_min, "Q1": Q1, "Q2": Q2, "Q3": Q3, "Max": x_max}
df = pd.DataFrame(data=dict, index=["Value"])
print(f"5 Point Summary of {x.capitalize()} Attribute:\n")
# Assuming 'tabulate' is available, if not, install it using: !pip install tabulate
from tabulate import tabulate
print(tabulate(df, headers="keys", tablefmt="psql"))
fig = plt.figure(figsize=(16, 8))
plt.subplots_adjust(hspace=0.6)
sns.set_palette("Pastel1")
# Corrected indentation here:
plt.subplot(221, frameon=True)
ax1 = sns.distplot(data[x], color="purple")
ax1.axvline(
np.mean(data[x]), color="purple", linestyle="--"
) # Add mean to the histogram
ax1.axvline(
np.median(data[x]), color="black", linestyle="-"
) # Add median to the histogram
plt.title(f"{x.capitalize()} Density Distribution")
plt.subplot(222, frameon=True)
ax2 = sns.violinplot(x=data[x], palette="Accent", split=True)
plt.title(f"{x.capitalize()} Violinplot")
plt.subplot(223, frameon=True, sharex=ax1)
ax3 = sns.boxplot(
x=data[x], palette="cool", width=0.7, linewidth=0.6, showmeans=True
)
plt.title(f"{x.capitalize()} Boxplot")
plt.subplot(224, frameon=True, sharex=ax2)
ax4 = sns.kdeplot(data[x], cumulative=True)
plt.title(f"{x.capitalize()} Cumulative Density Distribution")
plt.show()
summary(data, "Credit_Limit")
Observation:
It is observed that There are higher end outliers in Credit Limit. This might be because the customers are high end.
data[data["credit_limit"] > 23000]["income_category"].value_counts(normalize=True)
data[data["credit_limit"] > 23000]["card_category"].value_counts(normalize=True)
Observation:
It is observed that 83% have gold card
summary(data, "total_revolving_bal")
Observation:
It is observed that Data is right skewed
summary(data, "total_amt_chng_q4_q1")
Observation:
It is observed that Outliers are found on both the side
summary(data, "total_trans_amt")
summary(data, "total_trans_ct")
def perc_on_bar(data: pd.DataFrame, cat_columns, target, hue=None, perc=True):
'''
The function takes a category column as input and plots bar chart with percentages on top of each bar
Usage:
------
perc_on_bar(df, ['age'], 'prodtaken')
'''
subplot_cols = 2
subplot_rows = int(len(cat_columns)/2 + 1)
plt.figure(figsize=(16,3*subplot_rows))
for i, col in enumerate(cat_columns):
plt.subplot(subplot_rows,subplot_cols,i+1)
order = data[col].value_counts(ascending=False).index # Data order
ax=sns.countplot(data=data, x=col, palette = 'crest', order=order, hue=hue);
for p in ax.patches:
percentage = '{:.1f}%\n({})'.format(100 * p.get_height()/len(data[target]), p.get_height())
# Added percentage and actual value
x = p.get_x() + p.get_width() / 2
y = p.get_y() + p.get_height() + 40
if perc:
plt.annotate(percentage, (x, y), ha='center', color='black', fontsize='medium'); # Annotation on top of bars
plt.xticks(color='black', fontsize='medium', rotation= (-90 if col=='region' else 0));
plt.tight_layout()
plt.title(col.capitalize() + ' Percentage Bar Charts\n\n') # Moved out of inner loop
category_columns = data.select_dtypes(include="category").columns.tolist()
target_variable = "attrition_flag"
perc_on_bar(data, category_columns, target_variable)
Observation:
It is observed the below points
1.High Imbalance in data 2.Data is almost equally distributed between Males and Females 3.31% customers are Graduate 4.85% customers are either Single or Married, where 46.7% of the customers are Married 5.35% customers earn less than $40k and 36% earns $60k or more 6.93% customers have Blue card
Bi-variate Analysis
def box_by_target(data: pd.DataFrame, numeric_columns, target, include_outliers):
"""
The function takes a category column, target column, and whether to include outliers or not as input
and plots bar chart with percentages on top of each bar
Usage:
------
perc_on_bar(['age'], 'prodtaken', True)
"""
subplot_cols = 2
subplot_rows = int(len(numeric_columns) / 2 + 1)
plt.figure(figsize=(16, 3 * subplot_rows))
for i, col in enumerate(numeric_columns):
plt.subplot(8, 2, i + 1)
sns.boxplot(
data=data,
x=target,
y=col,
orient="vertical",
palette="Blues",
showfliers=include_outliers,
)
plt.tight_layout()
plt.title(str(i + 1) + ": " + target + " vs. " + col, color="black")
numeric_columns = data.select_dtypes(exclude="category").columns.tolist()
target_variable = "attrition_flag"
box_by_target(data, numeric_columns, target_variable, True)
Observation:
The above graphs are considered with outliers
box_by_target(data, numeric_columns, target_variable, False)
Observation:
The above graph is without outliers
def cat_view(df: pd.DataFrame, x, target):
"""
Function to create a Bar chart and a Pie chart for categorical variables.
"""
from matplotlib import cm
color1 = cm.inferno(np.linspace(0.4, 0.8, 30))
color2 = cm.viridis(np.linspace(0.4, 0.8, 30))
sns.set_palette("cubehelix")
fig, ax = plt.subplots(1, 2, figsize=(16, 4))
"""
Draw a Pie Chart on first subplot.
"""
s = data.groupby(x).size()
mydata_values = s.values.tolist()
mydata_index = s.index.tolist()
def func(pct, allvals):
absolute = int(pct / 100.0 * np.sum(allvals))
return "{:.1f}%\n({:d})".format(pct, absolute)
wedges, texts, autotexts = ax[0].pie(
mydata_values,
autopct=lambda pct: func(pct, mydata_values),
textprops=dict(color="w"),
)
ax[0].legend(
wedges,
mydata_index,
title=x.capitalize(),
loc="center left",
bbox_to_anchor=(1, 0, 0.5, 1),
)
plt.setp(autotexts, size=12)
ax[0].set_title(f"{x.capitalize()} Pie Chart")
"""
Draw a Bar Graph on second subplot.
"""
df = pd.pivot_table(
data, index=[x], columns=[target], values=["credit_limit"], aggfunc=len
)
labels = df.index.tolist()
no = df.values[:, 1].tolist()
yes = df.values[:, 0].tolist()
l = np.arange(len(labels)) # the label locations
width = 0.35 # the width of the bars
rects1 = ax[1].bar(
l - width / 2, no, width, label="Existing Customer", color=color1
)
rects2 = ax[1].bar(
l + width / 2, yes, width, label="Attrited Customer", color=color2
)
# Add some text for labels, title and custom x-axis tick labels, etc.
ax[1].set_ylabel("Scores")
ax[1].set_title(f"{x.capitalize()} Bar Graph")
ax[1].set_xticks(l)
ax[1].set_xticklabels(labels)
ax[1].legend()
def autolabel(rects):
"""Attach a text label above each bar in *rects*, displaying its height."""
for rect in rects:
height = rect.get_height()
ax[1].annotate(
"{}".format(height),
xy=(rect.get_x() + rect.get_width() / 2, height),
xytext=(0, 3), # 3 points vertical offset
textcoords="offset points",
fontsize="medium",
ha="center",
va="bottom",
)
autolabel(rects1)
autolabel(rects2)
fig.tight_layout()
plt.show()
"""
Draw a Stacked Bar Graph on bottom.
"""
sns.set(palette="tab10")
tab = pd.crosstab(data[x], data[target], normalize="index")
tab.plot.bar(stacked=True, figsize=(16, 3))
plt.title(x.capitalize() + " Stacked Bar Plot")
plt.legend(loc="upper right", bbox_to_anchor=(0, 1))
plt.show()
cat_view(data, "gender", "attrition_flag")
Observation:
It is observed that Attrition and gender are not related to each othert
cat_view(data, "education_level", "attrition_flag")
Education and attrition are not related to each other is the observation
cat_view(data, "income_category", "attrition_flag")
cat_view(data, "card_category", "attrition_flag")
f, ax = plt.subplots(figsize=(12, 8))
# Include only numerical features for correlation calculation
numerical_data = data.select_dtypes(include=np.number)
sns.heatmap(numerical_data.corr(), annot=True, cmap="Blues")
plt.show()
Observation:
The above graph gives correlation factors
def feature_name_standardize(df: pd.DataFrame):
df_ = df.copy()
df_.columns = [i.replace(" ", "_").lower() for i in df_.columns]
return df_
# Building a function to drop features
def drop_feature(df: pd.DataFrame, features: list = []):
df_ = df.copy()
if len(features) != 0:
df_ = df_.drop(columns=features)
return df_
def mask_value(df: pd.DataFrame, feature: str = None, value_to_mask: str = None, masked_value: str = None):
data = df.copy() # This line and the subsequent lines within this function should be indented
if feature != None and value_to_mask != None:
if feature in df_.columns:
df_[feature] = df_[feature].astype('object')
df_.loc[df_[df_[feature] == value_to_mask].index, feature] = masked_value
df_[feature] = df_[feature].astype('category')
return df_
# Building a custom imputer
def impute_category_unknown(df: pd.DataFrame, fill_value: str):
df_ = df.copy()
for col in df_.select_dtypes(include='category').columns.tolist():
df_[col] = df_[col].astype('object')
df_[col] = df_[col].fillna('Unknown')
df_[col] = df_[col].astype('category')
df = data.copy()
df.describe(include="all").T
columns_to_drop = [
"clientnum",
"credit_limit",
"dependent_count",
"months_on_book",
"avg_open_to_buy",
"customer_age",
"marital_status",
]
column_to_mask_value = "income_category"
value_to_mask = "abc"
masked_value = "Unknown"
# Random state and loss
seed = 1
loss_func = "logloss"
# Test and Validation sizes
test_size = 0.2
val_size = 0.25
# Dependent Varibale Value map
target_mapper = {"Attrited Customer": 1, "Existing Customer": 0}
cat_columns = df.select_dtypes(include="object").columns.tolist()
df[cat_columns] = df[cat_columns].astype("category")
X = data.drop(columns=["attrition_flag"]) # Changed "Attrition_Flag" to "attrition_flag"
y = data["attrition_flag"].map(target_mapper) # Changed "Attrition_Flag" to "attrition_flag"
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=test_size, random_state=seed, stratify=y
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=val_size, random_state=seed, stratify=y_temp
)
print(
"Training data shape: \n\n",
X_train.shape,
"\n\nValidation Data Shape: \n\n",
X_val.shape,
"\n\nTesting Data Shape: \n\n",
X_test.shape,
)
print("Training: \n", y_train.value_counts(normalize=True))
print("\n\nValidation: \n", y_val.value_counts(normalize=True))
print("\n\nTest: \n", y_test.value_counts(normalize=True))
Data processing
from sklearn.base import BaseEstimator, TransformerMixin
class FeatureNamesStandardizer(BaseEstimator, TransformerMixin):
"""
A transformer to standardize feature names:
- Replaces spaces with underscores.
- Converts to lowercase.
"""
def __init__(self):
pass
def fit(self, X, y=None):
return self
def transform(self, X):
if X is not None: # Check if X is not None
X_ = X.copy()
X_.columns = [i.replace(" ", "_").lower() for i in X_.columns]
return X_ # Return the modified DataFrame
else:
return None # Or raise an exception if None is unexpected # Handle the case where X is None
class FeatureNamesStandardizer(BaseEstimator, TransformerMixin):
"""
A transformer to standardize feature names:
- Replaces spaces with underscores.
- Converts to lowercase.
"""
def __init__(self):
pass
def fit(self, X, y=None):
return self
def transform(self, X):
X_ = X.copy()
X_.columns = [i.replace(" ", "_").lower() for i in X_.columns]
return X_ # Return the modified DataFrame X_
from sklearn.base import BaseEstimator, TransformerMixin
class FeatureNamesStandardizer(BaseEstimator, TransformerMixin):
"""
A transformer to standardize feature names:
- Replaces spaces with underscores.
- Converts to lowercase.
"""
def __init__(self):
pass
def fit(self, X, y=None):
return self
def transform(self, X):
if X is not None: # Check if X is not None
X_ = X.copy()
X_.columns = [i.replace(" ", "_").lower() for i in X_.columns]
return X_ # Return the modified DataFrame
else:
return None # Or raise an exception if None is unexpected
from sklearn.base import BaseEstimator, TransformerMixin
class FeatureNamesStandardizer(BaseEstimator, TransformerMixin):
"""
A transformer to standardize feature names:
- Replaces spaces with underscores.
- Converts to lowercase.
"""
def __init__(self):
pass
def fit(self, X, y=None):
return self
def transform(self, X):
X_ = X.copy()
X_.columns = [i.replace(" ", "_").lower() for i in X_.columns]
return X_ # Return the modified DataFrame
from sklearn.base import BaseEstimator, TransformerMixin
class ColumnDropper(BaseEstimator, TransformerMixin):
def __init__(self, features):
self.features = features
def fit(self, X, y=None):
return self
def transform(self, X):
return X.drop(columns=self.features, errors='ignore') # errors='ignore' to handle cases where a feature might
from sklearn.base import BaseEstimator, TransformerMixin
class FeatureNamesStandardizer(BaseEstimator, TransformerMixin):
"""
A transformer to standardize feature names:
- Replaces spaces with underscores.
- Converts to lowercase.
"""
def __init__(self):
pass
def fit(self, X, y=None):
return self
def transform(self, X):
X_ = X.copy()
X
from sklearn.base import BaseEstimator, TransformerMixin
class FeatureNamesStandardizer(BaseEstimator, TransformerMixin):
"""
A transformer to standardize feature names:
- Replaces spaces with underscores.
- Converts to lowercase.
"""
def __init__(self):
pass
def fit(self, X, y=None):
return self
def transform(self, X):
X_ = X.copy()
X_.columns = [i.replace(" ", "_").lower() for i in X_.columns]
return X_ # Return the modified DataFrame X_
from sklearn.base import BaseEstimator, TransformerMixin
class FeatureNamesStandardizer(BaseEstimator, TransformerMixin):
"""
A transformer to standardize feature names:
- Replaces spaces with underscores.
- Converts to lowercase.
"""
def __init__(self):
pass
def fit(self, X, y=None):
return self
def transform(self, X):
X_ = X.copy()
X_.columns = [i.replace(" ", "_").lower() for i in X_.columns]
return X_ # Return the modified DataFrame X_
# To Standardize feature names
feature_name_standardizer = FeatureNamesStandardizer()
# Assuming 'X' and 'y' are your features and target variable
# Split your data using train_test_split BEFORE the check for None values
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=test_size, random_state=seed, stratify=y
)
# Split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=val_size, random_state=seed, stratify=y_temp
)
# Check if X_train, X_val, X_test are None before proceeding
# This check should now pass as the variables have been assigned values
if X_train is None or X_val is None or X_test is None:
raise ValueError("X_train, X_val, or X_test are None. Check your data loading and splitting.")
# Proceed with your data processing steps
X_train = feature_name_standardizer.fit_transform(X_train)
X_val = feature_name_standardizer.transform(X_val)
X_test = feature_name_standardizer.transform(X_test)
X_train = feature_name_standardizer.fit_transform(X_train)
X_val = feature_name_standardizer.transform(X_val)
X_test = feature_name_standardizer.transform(X_test)
# To Drop unnecessary columns
column_dropper = ColumnDropper(features=columns_to_drop)
X_train = column_dropper.fit_transform(X_train)
X_val = column_dropper.transform(X_val)
X_test = column_dropper.transform(X_test)
from sklearn.base import BaseEstimator, TransformerMixin
class CustomValueMasker(BaseEstimator, TransformerMixin):
def __init__(self, feature, value_to_mask, masked_value):
self.feature = feature
self.value_to_mask = value_to_mask
self.masked_value = masked_value
def fit(self, X, y=None):
return self
def transform(self, X):
X_ = X.copy()
from sklearn.base import BaseEstimator, TransformerMixin
class CustomValueMasker(BaseEstimator, TransformerMixin):
def __init__(self, feature, value_to_mask, masked_value):
self.feature = feature
self.value_to_mask = value_to_mask
self.masked_value = masked_value
def fit(self, X, y=None):
return self
def transform(self, X):
X_ = X.copy()
# Check if the feature exists in the DataFrame before masking
if self.feature in X_.columns:
X_[self.feature] = X_[self.feature].replace(self.value_to_mask, self.masked_value)
return X_
from sklearn.base import BaseEstimator, TransformerMixin
class FillUnknown(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
return self
def transform(self, X):
X_ = X.copy()
for col in X_.columns:
# Access dtype of the Series using .dtypes
if X_[col].dtypes.name == 'category':
X_[col] = X_[col].cat.add_categories('Unknown')
X_[col] = X_[col].fillna('Unknown')
return X_
robust_scaler = RobustScaler(with_centering=False, with_scaling=True)
num_columns = [
"total_relationship_count",
"months_inactive_12_mon",
"contacts_count_12_mon",
"total_revolving_bal",
"total_amt_chng_q4_q1",
"total_trans_amt",
"total_trans_ct",
"total_ct_chng_q4_q1",
]
X_train[num_columns] = pd.DataFrame(
robust_scaler.fit_transform(X_train[num_columns]),
columns=num_columns,
index=X_train.index,
)
X_val[num_columns] = pd.DataFrame(
robust_scaler.transform(X_val[num_columns]), columns=num_columns, index=X_val.index
)
X_test[num_columns] = pd.DataFrame(
robust_scaler.transform(X_test[num_columns]),
columns=num_columns,
index=X_test.index,
)
print(X_train.columns)
print(X_val.columns)
print(X_test.columns)
X_train.head(3)
X_val.head(3)
X_test.head(3)
print(
"Training data shape: \n\n",
X_train.shape,
"\n\nValidation Data Shape: \n\n",
X_val.shape,
"\n\nTesting Data Shape: \n\n",
X_test.shape,
)
Model Building Considerations¶
Model evaluation criterion:
Model can make wrong predictions as: Predicting a customer will attrite and the customer does not attrite - Loss of resources Predicting a customer will not attrite and the customer attrites - Loss of opportunity for churning the customer Which case is more important? Predicting that customer will not attrite, but actually attrites, would result in loss for the bank since if predicted correctly,marketing/sales team could have contacted the customer to retain them. This would result in losses. So, the false negatives should be minimized. How to reduce this loss i.e need to reduce False Negatives? Company wants Recall to be maximized, greater the Recall lesser the chances of false negatives. Let's start by building different models using KFold and cross_val_score and tune the best model using RandomizedSearchCV
Stratified K-Folds cross-validation provides dataset indices to split data into train/validation sets. Split dataset into k consecutive folds (without shuffling by default) keeping the distribution of both classes in each fold the same as the target variable. Each fold is then used once as validation while the k - 1 remaining folds form the training set.
def get_metrics_score(
model, train, test, train_y, test_y, threshold=0.5, flag=False, roc=True
):
"""
Function to calculate different metric scores of the model - Accuracy, Recall, Precision, and F1 score
model: classifier to predict values of X
train, test: Independent features
train_y,test_y: Dependent variable
threshold: thresold for classifiying the observation as 1
flag: If the flag is set to True then only the print statements showing different will be displayed. The default value is set to True.
roc: If the roc is set to True then only roc score will be displayed. The default value is set to False.
"""
"""
Function to calculate different metric scores of the model - Accuracy, Recall, Precision, and F1 score
model: classifier to predict values of X
train, test: Independent features
train_y,test_y: Dependent variable
threshold: thresold for classifiying the observation as 1
flag: If the flag is set to True then only the print statements showing different will be displayed. The default value is set to True.
roc: If the roc is set to True then only roc score will be displayed. The default value is set to False.
"""
def get_metrics_score(
model, train, test, train_y, test_y, threshold=0.5, flag=False, roc=True
):
"""
Function to calculate different metric scores of the model - Accuracy, Recall, Precision, and F1 score
model: classifier to predict values of X
train, test: Independent features
train_y,test_y: Dependent variable
threshold: thresold for classifiying the observation as 1
flag: If the flag is set to True then only the print statements showing different will be displayed. The default value is set to True.
roc: If the roc is set to True then only roc score will be displayed. The default value is set to False.
"""
score_list = [] # This line was outside the function, causing issues
# Indent the following lines to be part of the function body
pred_train = model.predict_proba(train)[:, 1] > threshold
pred_test = model.predict_proba(test)[:, 1] > threshold
pred_train = np.round(pred_train)
pred_test = np.round(pred_test)
train_acc = accuracy_score(pred_train, train_y)
test_acc = accuracy_score(pred_test, test_y)
train_recall = recall_score(train_y, pred_train)
test_recall = recall_score(test_y, pred_test)
def get_metrics_score(
model, train, test, train_y, test_y, threshold=0.5, flag=False, roc=True
):
"""
Function to calculate different metric scores of the model - Accuracy, Recall, Precision, and F1 score
model: classifier to predict values of X
train, test: Independent features
train_y,test_y: Dependent variable
threshold: thresold for classifiying the observation as 1
flag: If the flag is set to True then only the print statements showing different will be displayed. The default value is set to True.
roc: If the roc is set to True then only roc score will be displayed. The default value is set to False.
"""
score_list = []
# Indent the following lines to be part of the function body
pred_train = model.predict_proba(train)[:, 1] > threshold
pred_test = model.predict_proba(test)[:, 1] > threshold
pred_train = np.round(pred_train)
pred_test = np.round(pred_test)
train_acc = accuracy_score(pred_train, train_y)
test_acc = accuracy_score(pred_test, test_y)
train_recall = recall_score(train_y, pred_train)
test_recall = recall_score(test_y, pred_test)
train_precision = precision_score(train_y, pred_train)
test_precision = precision_score(test_y, pred_test)
def get_metrics_score(
model, train, test, train_y, test_y, threshold=0.5, flag=False, roc=True
):
"""
Function to calculate different metric scores of the model - Accuracy, Recall, Precision, and F1 score
model: classifier to predict values of X
train, test: Independent features
train_y,test_y: Dependent variable
threshold: thresold for classifiying the observation as 1
flag: If the flag is set to True then only the print statements showing different will be displayed. The default value is set to True.
roc: If the roc is set to True then only roc score will be displayed. The default value is set to False.
"""
# Initialize score_list within the function
score_list = []
pred_train = model.predict_proba(train)[:, 1] > threshold
pred_test = model.predict_proba(test)[:, 1] > threshold
pred_train = np.round(pred_train)
pred_test = np.round(pred_test)
train_acc = accuracy_score(pred_train, train_y)
test_acc = accuracy_score(pred_test, test_y)
train_recall
def get_metrics_score(
model, train, test, train_y, test_y, threshold=0.5, flag=False, roc=True
):
"""
Function to calculate different metric scores of the model - Accuracy, Recall, Precision, and F1 score
model: classifier to predict values of X
train, test: Independent features
train_y,test_y: Dependent variable
threshold: thresold for classifiying the observation as 1
flag: If the flag is set to True then only the print statements showing different will be displayed. The default value is set to True.
roc: If the roc is set to True then only roc score will be displayed. The default value is set to False.
"""
# Initialize score_list within the function
score_list = []
pred_train = (model.predict_proba(train)[:, 1] > threshold)
pred_test = (model.predict_proba(test)[:, 1] > threshold)
pred_train = np.round(pred_train)
pred_test = np.round(pred_test)
train_acc = accuracy_score(train_y, pred_train)
test_acc = accuracy_score(test_y, pred_test)
train_recall = recall_score(train_y, pred_train) # Calculate train_recall
test_recall = recall_score(test_y, pred_test) # Calculate test_recall
train_precision = precision_score(train_y, pred_train) # Calculate train_precision
test_precision = precision_score(test_y, pred_test) # Calculate test_precision
train_f1 = f1_score(train_y, pred_train) # Calculate train_f1
test_f1 = f1_score(test_y, pred_test) # Calculate test_f1
# Add the calculated values to the score_list
# ... (rest of the function implementation to append scores to score_list) ...
return score_list # Correct indentation for return statement
Function for Confusion Matrix
def make_confusion_matrix(model, test_X, y_actual, labels=[1, 0]):
"""
model : classifier to predict values of X
test_X: test set
y_actual : ground truth
"""
y_predict = model.predict(test_X)
cm = metrics.confusion_matrix(y_actual, y_predict, labels=[1, 0])
df_cm = pd.DataFrame(
cm,
index=[i for i in ["Actual - Attrited", "Actual - Existing"]],
columns=[i for i in ["Predicted - Attrited", "Predicted - Existing"]],
)
group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2, 2)
plt.figure(figsize=(5, 3))
sns.heatmap(df_cm, annot=labels, fmt="", cmap="Blues").set(title="Confusion Matrix")
Add scores to score list
model_names = []
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_train = []
f1_test = []
roc_auc_train = []
roc_auc_test = []
cross_val_train = []
def add_score_model(model_name, score, cv_res):
"""Add scores to list so that we can compare all models score together"""
model_names.append(model_name)
acc_train.append(score[0])
acc_test.append(score[1])
recall_train.append(score[2])
recall_test.append(score[3])
precision_train.append(score[4])
precision_test.append(score[5])
f1_train.append(score[6])
f1_test.append(score[7])
roc_auc_train.append(score[8])
roc_auc_test.append(score[9])
cross_val_train.append(cv_res)
Building Models¶ We are building 8 models here, Logistic Regression, Bagging, Random Forest, Gradient Boosting, Ada Boosting, Extreme Gradient Boosting, Decision Tree, and Light Gradient Boosting.
Build and Train Models We are building below 5 models:
Bagging Random Forest Classification Gradient Boosting Machine Adaptive Boosting eXtreme Gradient Boosting
models = [] # Empty list to store all the models
cv_results = []
# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=seed)))
models.append(("Random forest", RandomForestClassifier(random_state=seed)))
models.append(("GBM", GradientBoostingClassifier(random_state=seed)))
models.append(("Adaboost", AdaBoostClassifier(random_state=seed)))
models.append(("Xgboost", XGBClassifier(random_state=seed, eval_metric=loss_func)))
models.append(("dtree", DecisionTreeClassifier(random_state=seed)))
models.append(("Light GBM", lgb.LGBMClassifier(random_state=seed)))
print(X_train.dtypes) # Identify non-numeric columns
print(X_train.select_dtypes(include=['object']).head())
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
categorical_cols = X_train.select_dtypes(include=['category']).columns
print("Categorical Columns:", categorical_cols)
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
# Identify categorical columns
categorical_cols = X_train.select_dtypes(include=['category']).columns
print(X_train.dtypes) # Should only contain numerical data (int or float)
print(X_val.dtypes)
print(X_train.shape, X_val.shape)
print(X_train.dtypes) # Check data types
print(X_train.select_dtypes(include=['object', 'category']).head()) # Show categorical data
import pandas as pd
X_train_encoded = pd.get_dummies(X_train, drop_first=True)
X_val_encoded = pd.get_dummies(X_val, drop_first=True)
# Ensure both have same columns after encoding
X_train_encoded, X_val_encoded = X_train_encoded.align(X_val_encoded, join='left', axis=1, fill_value=0)
from sklearn.preprocessing import LabelEncoder
# Reset index in case it was causing issues
X_train = X_train.reset_index(drop=True)
X_val = X_val.reset_index(drop=True)
label_encoders = {}
for col in X_train.select_dtypes(include=['object', 'category']).columns:
# Ensure the column is a Series before applying LabelEncoder
if X_train[col].ndim == 1: # Check if the column is 1-dimensional
le = LabelEncoder()
X_train[col] = le.fit_transform(X_train[col])
X_val[col] = le.transform(X_val[col])
label_encoders[col] = le
else:
print(f"Warning: Skipping column '{col}' as it is not 1-dimensional.")
def get_metrics_score(model, train, test, train_y, test_y, threshold=0.5, flag=False, roc=True):
"""
Function to calculate different metric scores of the model - Accuracy, Recall, Precision, and F1 score
model: classifier to predict values of X
train, test: Independent features
train_y,test_y: Dependent variable
threshold: thresold for classifiying the observation as 1
flag: If the flag is set to True then only the print statements showing different will be displayed. The default value is set to True.
roc: If the roc is set to True then only roc score will be displayed. The default value is set to False.
"""
# Initialize score_list within the function
score_list = []
pred_train = (model.predict_proba(train)[:, 1] > threshold)
pred_test = (model.predict_proba(test)[:, 1] > threshold)
pred_train = np.round(pred_train)
pred_test = np.round(pred_test)
train_acc = accuracy_score(train_y, pred_train)
test_acc = accuracy_score(test_y, pred_test)
train_recall = recall_score(train_y, pred_train)
test_recall = recall_score(test_y, pred_test)
train_precision = precision_score(train_y, pred_train)
test_precision = precision_score(test_y, pred_test)
train_f1 = f1_score(train_y, pred_train)
test_f1 = f1_score(test_y, pred_test)
# Append the calculated metric scores to score_list
score_list.extend([train_acc, test_acc, train_recall, test_recall, train_precision, test_precision, train_f1, test_f1])
return score_list # Correct indentation for return statement
print("operation completed")
print("Score Output:", model_score) # Debugging step
print("Length of Score:", len(model_score))
def get_metrics_score(model, X_train, X_val, y_train, y_val):
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
# Predictions
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)
# Calculate metrics
accuracy_train = accuracy_score(y_train, y_train_pred)
accuracy_val = accuracy_score(y_val, y_val_pred)
precision_train = precision_score(y_train, y_train_pred)
precision_val = precision_score(y_val, y_val_pred)
recall_train = recall_score(y_train, y_train_pred)
recall_val = recall_score(y_val, y_val_pred)
f1_train = f1_score(y_train, y_train_pred)
f1_val = f1_score(y_val, y_val_pred)
roc_auc_train = roc_auc_score(y_train, y_train_pred)
roc_auc_val = roc_auc_score(y_val, y_val_pred)
# Return all 9 values
return [
accuracy_train, accuracy_val,
precision_train, precision_val,
recall_train, recall_val,
f1_train, f1_val,
roc_auc_train, roc_auc_val
]
def add_score_model(model_name, score, cv_res):
model_names.append(model_name)
# Check length before appending
if len(score) < 10:
print(f"Warning: `score` has only {len(score)} elements!")
accuracy_train.append(score[0])
accuracy_test.append(score[1])
precision_train.append(score[2])
precision_test.append(score[3])
recall_train.append(score[4])
recall_test.append(score[5])
f1_train.append(score[6])
f1_test.append(score[7])
# Only append if enough values exist
if len(score) > 8:
roc_auc_train.append(score[8])
roc_auc_test.append(score[9])
cross_val_train.append(cv_res)
print("Length of model_names:", len(model_names))
print("Length of cross_val_train:", len(cross_val_train))
print("Length of acc_train:", len(acc_train))
print("Length of acc_test:", len(acc_test))
print("Length of recall_train:", len(recall_train))
print("Length of recall_test:", len(recall_test))
print("Length of precision_train:", len(precision_train))
print("Length of precision_test:", len(precision_test))
print("Length of f1_train:", len(f1_train))
print("Length of f1_test:", len(f1_test))
print("Length of roc_auc_train:", len(roc_auc_train))
print("Length of roc_auc_test:", len(roc_auc_test))
max_length = len(model_names) # Use model_names as the reference length
# Ensure all lists have the same length
lists_to_fix = [
cross_val_train, acc_train, acc_test, recall_train, recall_test,
precision_train, precision_test, f1_train, f1_test, roc_auc_train, roc_auc_test
]
for lst in lists_to_fix:
while len(lst) < max_length:
lst.append(None) # or 0.0 if you prefer numerical values
# Now, re-run the DataFrame creation
comparison_frame = pd.DataFrame(
{
"Model": model_names,
"Cross_Val_Score_Train": cross_val_train,
"Train_Accuracy": acc_train,
"Test_Accuracy": acc_test,
"Train_Recall": recall_train,
"Test_Recall": recall_test,
"Train_Precision": precision_train,
"Test_Precision": precision_test,
"Train_F1": f1_train,
"Test_F1": f1_test,
"Train_ROC_AUC": roc_auc_train,
"Test_ROC_AUC": roc_auc_test,
}
)
for name, model in models:
print(f"Processing model: {name}") # Debugging line
scoring = "recall"
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
try:
cv_result = cross_val_score(model, X_train, y_train, scoring=scoring, cv=kfold)
cv_results.append(cv_result)
model.fit(X_train, y_train)
model_score = get_metrics_score(model, X_train, X_val, y_train, y_val)
add_score_model(name, model_score, cv_result.mean())
except Exception as e:
print(f"⚠️ Error in model {name}: {e}") # Catch any errors
Comparing Models
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_train = []
f1_test = []
roc_auc_train = []
roc_auc_test = []
cross_val_train = []
model_names = []
def add_score_model(model_name, score, cv_res):
global acc_train, acc_test, recall_train, recall_test
global precision_train, precision_test, f1_train, f1_test
global roc_auc_train, roc_auc_test, cross_val_train, model_names
# Debugging print statement
print(f"Adding model: {model_name}")
model_names.append(model_name)
acc_train.append(score[0])
acc_test.append(score[1])
recall_train.append(score[2])
recall_test.append(score[3])
precision_train.append(score[4])
precision_test.append(score[5])
f1_train.append(score[6])
f1_test.append(score[7])
# Ensure index 8 & 9 exist in score
if len(score) > 8:
roc_auc_train.append(score[8])
roc_auc_test.append(score[9])
else:
roc_auc_train.append(None)
roc_auc_test.append(None)
cross_val_train.append(cv_res)
print(f"Model: {name}, Score Output: {model_score}")
comparison_frame = pd.DataFrame(
{
"Model": model_names,
"Cross_Val_Score_Train": cross_val_train,
"Train_Accuracy": acc_train,
"Test_Accuracy": acc_test,
"Train_Recall": recall_train,
"Test_Recall": recall_test,
"Train_Precision": precision_train,
"Test_Precision": precision_test,
"Train_F1": f1_train,
"Test_F1": f1_test,
"Train_ROC_AUC": roc_auc_train,
"Test_ROC_AUC": roc_auc_test,
}
)
# Sorting models in decreasing order of test recall
comparison_frame.sort_values(
by=["Cross_Val_Score_Train", "Test_Recall"], ascending=False
).style.highlight_max(color="lightgreen", axis=0).highlight_min(color="pink", axis=0)
def add_score_model(model_name, score, cv_res):
print(f"Model: {model_name}, Score: {score}, CV Result: {cv_res}") # Debugging print
model_names.append(model_name)
cross_val_train.append(cv_res)
acc_train.append(score[0])
acc_test.append(score[1])
recall_train.append(score[2])
recall_test.append(score[3])
precision_train.append(score[4])
precision_test.append(score[5])
f1_train.append(score[6])
f1_test.append(score[7])
roc_auc_train.append(score[8])
roc_auc_test.append(score[9])
for name, model in models:
print(f"Processing model: {name}") # Debugging step
scoring = "recall"
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
try:
cv_result = cross_val_score(model, X_train, y_train, scoring=scoring, cv=kfold)
cv_results.append(cv_result)
model.fit(X_train, y_train)
model_score = get_metrics_score(model, X_train, X_val, y_train, y_val)
print(f"Calling add_score_model() for: {name}") # Debugging print
add_score_model(name, model_score, cv_result.mean()) # Check if this runs
except Exception as e:
print(f"⚠️ Error in model {name}: {e}")
print("Operation Completed!")
print(f"Metrics for {name}: {model_score}")
comparison_frame = pd.DataFrame(
{
"Model": model_names,
"Cross_Val_Score_Train": cross_val_train,
"Train_Accuracy": acc_train,
"Test_Accuracy": acc_test,
"Train_Recall": recall_train,
"Test_Recall": recall_test,
"Train_Precision": precision_train,
"Test_Precision": precision_test,
"Train_F1": f1_train,
"Test_F1": f1_test,
"Train_ROC_AUC": roc_auc_train,
"Test_ROC_AUC": roc_auc_test,
}
)
# Sorting models in decreasing order of test recall
comparison_frame.sort_values(
by=["Cross_Val_Score_Train", "Test_Recall"], ascending=False
).style.highlight_max(color="lightgreen", axis=0).highlight_min(color="pink", axis=0)
Observation:
It is observed that The best model with respect to cross validation score and test recall is Light GBM The next best models are XGBoost, GBM and AdaBoost respectively
print(f"Number of models: {len(model_names)}")
print(f"Number of CV results: {len(cv_results)}")
for i, scores in enumerate(cv_results):
print(f"Model {i}: {len(scores)} scores -> {scores}")
import matplotlib.pyplot as plt
# Validate cv_results and model_names lengths
if len(cv_results) != len(model_names):
print("⚠️ Error: Mismatch in model count and CV results length!")
print(f"Number of models: {len(model_names)}")
print(f"Number of CV results: {len(cv_results)}")
else:
print("✅ Data structure looks correct!")
# Plot the boxplot only if the lengths match
if len(cv_results) == len(model_names):
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(cv_results) # Use cv_results directly if it has the correct structure
ax.set_xticklabels(model_names, rotation=45, ha="right") # Rotate for readability
plt.ylabel("Cross-Validation Score")
plt.xlabel("Models")
plt.show()
cv_result = cross_val_score(model, X_train, y_train, scoring=scoring, cv=kfold)
cv_results.append(cv_result)
cv_results.append(list(cv_result)) # Ensure it's stored as a list of lists
cv_results = [] # Reset the list
for name, model in models:
print(f"Processing model: {name}") # Debugging line
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
try:
cv_result = cross_val_score(model, X_train, y_train, scoring="recall", cv=kfold)
cv_results.append(list(cv_result)) # Ensure it's a list of lists
except Exception as e:
print(f"⚠️ Error in model {name}: {e}")
print(f"Final CV results length: {len(cv_results)}") # Debugging
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(cv_results)
ax.set_xticklabels(model_names)
plt.show()
Observation:
It is observed that It appears Light GBM, XGBoost, GBM are the models with good potential. Ada Boost also looks good with the higher end outlier performance score
Oversampling train data using SMOTE¶ Our dataset has a huge imbalance in target variable labels. To deal with such datasets, we have a few tricks up our sleeves, which we call Imbalanced Classification.
One approach to addressing imbalanced datasets is to oversample the minority class. The simplest approach involves duplicating examples in the minority class, although these examples don’t add any new information to the model. Instead, new examples can be synthesized from the existing examples. This is a type of data augmentation for the minority class and is referred to as the Synthetic Minority Oversampling Technique, or SMOTE for short.
print("Before UpSampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before UpSampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
sm = SMOTE(
sampling_strategy="minority", k_neighbors=10, random_state=seed
) # Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After UpSampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After UpSampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))
print("After UpSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After UpSampling, the shape of train_y: {} \n".format(y_train_over.shape))
for name, model in models:
print(f"Processing model: {name}") # Debugging line
scoring = "recall"
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
try:
cv_result_over = cross_val_score(
estimator=model, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold
) # No extra spaces before this line
cv_results.append(cv_result_over) # Append the result
model.fit(X_train_over, y_train_over) # Fit the model
model_score_over = get_metrics_score(
model, X_train_over, X_val, y_train_over, y_val
) # Ensure alignment
add_score_model(name, model_score_over, cv_result_over.mean()) # Update scores
except Exception as e:
print(f"⚠️ Error in model {name}: {e}") # Catch errors gracefully
print("Operation Completed!")
comparison_frame = pd.DataFrame(
{
"Model": model_names,
"Cross_Val_Score_Train": cross_val_train,
"Train_Accuracy": acc_train,
"Test_Accuracy": acc_test,
"Train_Recall": recall_train,
"Test_Recall": recall_test,
"Train_Precision": precision_train,
"Test_Precision": precision_test,
"Train_F1": f1_train,
"Test_F1": f1_test,
"Train_ROC_AUC": roc_auc_train,
"Test_ROC_AUC": roc_auc_test,
}
)
# Sorting models in decreasing order of test recall
comparison_frame.sort_values(
by=["Test_Recall", "Cross_Val_Score_Train"], ascending=False
).style.highlight_max(color="lightgreen", axis=0).highlight_min(color="pink", axis=0)
Observation:
It is observed that The best 4 models with respect to validation recall and cross validation score, are as follows: Light GBM trained with over/up-sampled data GBM trained with over/up-sampled data AdaBoost trained with over/up-sampled data XGBoost trained with over/up-sampled data
Undersampling train data using Random Under Sampler Undersampling is another way of dealing with imbalance in the dataset.
Random undersampling involves randomly selecting examples from the majority class and deleting them from the training dataset until a balanced dataset is created.
rus = RandomUnderSampler(random_state=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))
print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))
Build Models with Undersampled Data Build and Train Models
models_under = []
# Appending models into the list
models_under.append(("Bagging DownSampling", BaggingClassifier(random_state=seed)))
models_under.append(
("Random forest DownSampling", RandomForestClassifier(random_state=seed))
)
models_under.append(("GBM DownSampling", GradientBoostingClassifier(random_state=seed)))
models_under.append(("Adaboost DownSampling", AdaBoostClassifier(random_state=seed)))
models_under.append(
("Xgboost DownSampling", XGBClassifier(random_state=seed, eval_metric=loss_func))
)
models_under.append(("dtree DownSampling", DecisionTreeClassifier(random_state=seed)))
models_under.append(("Light GBM DownSampling", lgb.LGBMClassifier(random_state=seed)))
for name, model in models_under:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=10, shuffle=True, random_state=1
) # Setting number of splits equal to 10
cv_result_under = cross_val_score(
estimator=model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)
cv_results.append(cv_result_under)
model.fit(X_train_un, y_train_un)
model_score_under = get_metrics_score(model, X_train_un, X_val, y_train_un, y_val)
add_score_model(name, model_score_under, cv_result_under.mean())
print("Operation Completed!")
comparison_frame = pd.DataFrame(
{
"Model": model_names,
"Cross_Val_Score_Train": cross_val_train,
"Train_Accuracy": acc_train,
"Test_Accuracy": acc_test,
"Train_Recall": recall_train,
"Test_Recall": recall_test,
"Train_Precision": precision_train,
"Test_Precision": precision_test,
"Train_F1": f1_train,
"Test_F1": f1_test,
"Train_ROC_AUC": roc_auc_train,
"Test_ROC_AUC": roc_auc_test,
}
)
# Sorting models in decreasing order of test recall
comparison_frame.sort_values(
by=["Test_Recall", "Cross_Val_Score_Train"], ascending=False
).style.highlight_max(color="lightgreen", axis=0).highlight_min(color="pink", axis=0)
Observation:
The 4 best models are: XGBoost trained with undersampled data AdaBoost trained with undersampled data Light GBM trained with undersampled data GBM trained with undersampled data
%%time
# defining model
model = XGBClassifier(random_state=seed, eval_metric=loss_func)
# Parameter grid to pass in RandomizedSearchCV
param_grid={'n_estimators':np.arange(50,500,50),
'scale_pos_weight':[2,5,10],
'learning_rate':[0.01,0.1,0.2,0.05],
'gamma':[0,1,3,5],
'subsample':[0.8,0.9,1],
'max_depth':np.arange(4,20,1),
'reg_lambda':[5,10, 15, 20]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
xgb_tuned = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=10, random_state=seed, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
xgb_tuned.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(xgb_tuned.best_params_,xgb_tuned.best_score_))
# building model with best parameters
xgb_tuned_model = XGBClassifier(
n_estimators=150,
scale_pos_weight=10,
subsample=1,
reg_lambda=20,
max_depth=5,
learning_rate=0.01,
gamma=0,
eval_metric=loss_func,
random_state=seed,
)
# Fit the model on training data
xgb_tuned_model.fit(X_train_un, y_train_un)
xgb_tuned_model_score = get_metrics_score(
xgb_tuned_model, X_train, X_val, y_train, y_val
)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
scoring = "recall"
xgb_down_cv = cross_val_score(
estimator=xgb_tuned_model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)
add_score_model(
"XGB Tuned with Down Sampling", xgb_tuned_model_score, xgb_down_cv.mean()
)
make_confusion_matrix(xgb_tuned_model, X_val, y_val)
%%time
# defining model
model = AdaBoostClassifier(random_state=seed)
# Parameter grid to pass in RandomizedSearchCV
param_grid={'n_estimators':np.arange(50,2000,50),
'learning_rate':[0.01,0.1,0.2,0.05]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
ada_tuned = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=10, random_state=seed, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
ada_tuned.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(ada_tuned.best_params_,ada_tuned.best_score_))
# building model with best parameters
ada_tuned_model = AdaBoostClassifier(
n_estimators=1050, learning_rate=0.1, random_state=seed
)
# Fit the model on training data
ada_tuned_model.fit(X_train_un, y_train_un)
ada_tuned_model_score = get_metrics_score(
ada_tuned_model, X_train, X_val, y_train, y_val
)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
scoring = "recall"
ada_down_cv = cross_val_score(
estimator=ada_tuned_model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)
add_score_model(
"AdaBoost Tuned with Down Sampling", ada_tuned_model_score, ada_down_cv.mean()
)
Confusion matrix on validation
make_confusion_matrix(ada_tuned_model, X_val, y_val)
Tuning Light GBM with Down-Sampled data
%%time
# defining model
model = lgb.LGBMClassifier(random_state=seed)
# Hyper parameters
min_gain_to_split = [0.01, 0.1, 0.2, 0.3]
min_data_in_leaf = [10, 20, 30, 40, 50]
feature_fraction = [0.8, 0.9, 1.0]
max_depth = [5, 8, 15, 25, 30]
extra_trees = [True, False]
learning_rate = [0.01,0.1,0.2,0.05]
# Parameter grid to pass in RandomizedSearchCV
param_grid={'min_gain_to_split': min_gain_to_split,
'min_data_in_leaf': min_data_in_leaf,
'feature_fraction': feature_fraction,
'max_depth': max_depth,
'extra_trees': extra_trees,
'learning_rate': learning_rate,
'boosting_type': ['gbdt'],
'objective': ['binary'],
'is_unbalance': [True],
'metric': ['binary_logloss'],}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
lgbm_tuned = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=10, random_state=seed, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
lgbm_tuned.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(lgbm_tuned.best_params_,lgbm_tuned.best_score_))
Building the model with the resulted best parameters¶
lgbm_tuned_model = lgb.LGBMClassifier(
min_gain_to_split = 0.01,
min_data_in_leaf = 50,
feature_fraction = 0.8,
max_depth = 8,
extra_trees = False,
learning_rate = 0.2,
objective = 'binary',
metric = 'binary_logloss',
is_unbalance = True,
boosting_type = 'gbdt',
random_state = seed
)
# Fit the model on training data
lgbm_tuned_model.fit(X_train_un, y_train_un)
lgbm_tuned_model_score = get_metrics_score(
lgbm_tuned_model, X_train, X_val, y_train, y_val
)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
scoring = "recall"
lgb_down_cv = cross_val_score(
estimator=lgbm_tuned_model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)
add_score_model(
"Light GBM Tuned with Down Sampling", lgbm_tuned_model_score, lgb_down_cv.mean()
)
make_confusion_matrix(lgbm_tuned_model, X_val, y_val)
Tuning GBM with Down Sampled data
%%time
# defining model
model = GradientBoostingClassifier(random_state=seed)
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 50, stop = 2000, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [5, 8, 15, 25, 30]
min_samples_split = [2, 5, 10, 15, 100]
min_samples_leaf = [1, 2, 5, 10, 15]
# Parameter grid to pass in RandomizedSearchCV
param_grid={'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
gbm_tuned = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=10, random_state=seed, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
gbm_tuned.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(gbm_tuned.best_params_,gbm_tuned.best_score_))
gbm_tuned_model = GradientBoostingClassifier(
n_estimators=700,
max_features="sqrt", # Change "auto" to "sqrt" or another valid option
max_depth=25,
min_samples_split=2,
min_samples_leaf=15,
random_state=seed,
)
gbm_tuned_model = GradientBoostingClassifier(
n_estimators=700,
max_features="sqrt",
max_depth=25,
min_samples_split=2,
min_samples_leaf=15,
random_state=seed,
)
# Fit the model on training data
gbm_tuned_model.fit(X_train_un, y_train_un)
gbm_tuned_model_score = get_metrics_score(
gbm_tuned_model, X_train, X_val, y_train, y_val
)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
scoring = "recall"
gbm_down_cv = cross_val_score(
estimator=gbm_tuned_model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)
add_score_model(
"GBM Tuned with Down Sampling", gbm_tuned_model_score, gbm_down_cv.mean()
)
comparison_frame = pd.DataFrame(
{
"Model": model_names,
"Cross_Val_Score_Train": cross_val_train,
"Train_Accuracy": acc_train,
"Test_Accuracy": acc_test,
"Train_Recall": recall_train,
"Test_Recall": recall_test,
"Train_Precision": precision_train,
"Test_Precision": precision_test,
"Train_F1": f1_train,
"Test_F1": f1_test,
"Train_ROC_AUC": roc_auc_train,
"Test_ROC_AUC": roc_auc_test,
}
)
for col in comparison_frame.select_dtypes(include="float64").columns.tolist():
comparison_frame[col] = round(comparison_frame[col] * 100, 0).astype(int)
comparison_frame.tail(4).sort_values(
by=["Cross_Val_Score_Train", "Test_Recall"], ascending=False
)
feature_names = X_train.columns
importances = gbm_tuned_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
print(X_train.dtypes)
print(X_train.head())
print(X_train.select_dtypes(include=['object']).head())
from sklearn.preprocessing import LabelEncoder
# Get the actual categorical column names
categorical_cols = X_train.select_dtypes(include=['category', 'object']).columns.tolist() # Include both 'category' and 'object' types
# Apply Label Encoding to each categorical column
for col in categorical_cols:
le = LabelEncoder() # Create a new LabelEncoder for each column
X_train[col] = le.fit_transform(X_train[col])
X_test[col] = le.transform(X_test[col])
print(X_train.columns) # Check available columns
print(X_train.dtypes)
final_acc_test = 0.0 # or some computed value
if "final_acc_test" not in locals():
final_acc_test = 0.0 # Set a default value
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
column_name = "actual_column_name_here" # Replace with the correct column name
if column_name in X_train.columns:
X_train[column_name] = le.fit_transform(X_train[column_name])
X_test[column_name] = le.transform(X_test[column_name])
else:
print(f"Column '{column_name}' not found in X_train")
print([col for col in X_train.columns if "your_column_name" in col])
final_recall_train = 0.0
if "final_recall_train" not in locals():
final_recall_train = 0.0
final_recall_test = 0.0
if "final_recall_test" not in locals():
final_recall_test = 0.0
final_precision_train =0.0
if "final_precision_train " not in locals():
final_precision_train = 0.0
final_precision_test =0.0
if "final_precision_test " not in locals():
final_precision_test = 0.0
final_f1_train =0.0
if "final_f1_train " not in locals():
final_f1_train = 0.0
final_f1_test =0.0
if "final_f1_test " not in locals():
final_f1_test = 0.0
final_roc_auc_train =0.0
if "final_roc_auc_train " not in locals():
final_roc_auc_train = 0.0
final_roc_auc_test =0.0
if "final_roc_auc_test " not in locals():
final_roc_auc_test = 0.0
print(X_train.info()) # Check data types
print(X_train.head()) # Inspect first few rows
print(X_train.select_dtypes(include=['object']).head()) # Show non-numeric columns
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
# Ensure both datasets have the same columns
X_train, X_test = X_train.align(X_test, join='left', axis=1, fill_value=0)
gbm_tuned_model_test_score = get_metrics_score(
gbm_tuned_model, X_train, X_test, y_train, y_test
)
gbm_tuned_model_test_score = get_metrics_score(
gbm_tuned_model, X_train, X_test, y_train, y_test
)
final_model_names = ["gbm Tuned Down-sampled Trained"]
final_acc_train = [gbm_tuned_model_test_score[0]]
final_acc_test = [gbm_tuned_model_test_score[1]]
final_recall_train = [gbm_tuned_model_test_score[2]]
final_recall_test = [gbm_tuned_model_test_score[3]]
final_precision_train = [gbm_tuned_model_test_score[4]]
final_precision_test = [gbm_tuned_model_test_score[5]]
final_f1_train = [gbm_tuned_model_test_score[6]]
final_f1_test = [gbm_tuned_model_test_score[7]]
final_roc_auc_train = [gbm_tuned_model_test_score[8]]
final_roc_auc_test = [gbm_tuned_model_test_score[9]]
final_result_score = pd.DataFrame(
{
"Model": final_model_names,
"Train_Accuracy": final_acc_train,
"Test_Accuracy": final_acc_test,
"Train_Recall": final_recall_train,
"Test_Recall": final_recall_test,
"Train_Precision": final_precision_train,
"Test_Precision": final_precision_test,
"Train_F1": final_f1_train,
"Test_F1": final_f1_test,
"Train_ROC_AUC": final_roc_auc_train,
"Test_ROC_AUC": final_roc_auc_test,
}
)
for col in final_result_score.select_dtypes(include="float64").columns.tolist():
final_result_score[col] = final_result_score[col] * 100
final_result_score
make_confusion_matrix(gbm_tuned_model, X_test, y_test)
Giant Chart
!pip install scipy==1.10 scikit-plot --upgrade
from numpy import interp
!pip install scipy==1.11.4
gbm_tuned_model.fit(X_train_encoded, y_train) # Train the model with your encoded training data
y_pred_prob = gbm_tuned_model.predict_proba(X_test_encoded) # Now predict
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
# Identify categorical features (assuming they are of 'object' type)
categorical_features = X_test.select_dtypes(include=['object', 'category']).columns.tolist()
# Create a ColumnTransformer to apply OneHotEncoder to categorical features
preprocessor = ColumnTransformer(
transformers=[
('num', 'passthrough', X_test.select_dtypes(exclude=['object', 'category']).columns.tolist()),
('cat', OneHotEncoder(sparse_output=False, handle_unknown='ignore'), categorical_features)])
# Fit and transform the preprocessor on your training data (X_train)
# This ensures that the same encoding is applied to both training and test data
X_train_encoded = preprocessor.fit_transform(X_train)
# Transform the test data (X_test) using the fitted preprocessor
X_test_encoded = preprocessor.transform(X_test)
# Now you can use X_test_encoded with your model's predict_proba method
y_pred_prob = gbm_tuned_model.predict_proba(X_test_encoded)
# Continue with the rest of your code...
from sklearn.metrics import RocCurveDisplay
# Assuming 'gbm_tuned_model', 'X_test', and 'y_test' are defined
RocCurveDisplay.from_estimator(gbm_tuned_model, X_test, y_test)
plt.title("Receiver Operating Characteristic")
plt.legend(loc="lower right")
plt.plot([0, 1], [0, 1], "b--")
plt.xlim([-0.05, 1])
plt.ylim([0, 1.05])
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.show()
seed = 1
loss_func = "logloss"
# Test and Validation sizes
test_size = 0.2
val_size = 0.25
# Dependent Varibale Value map
target_mapper = {"Attrited Customer": 1, "Existing Customer": 0}
df_pipe = data.copy()
cat_columns = df_pipe.select_dtypes(include="object").columns.tolist()
df_pipe[cat_columns] = df_pipe[cat_columns].astype("category")
X = df_pipe.drop(columns=["attrition_flag"]) # Replace 'data_pipe' with 'df_pipe'
y = df_pipe["attrition_flag"].map(target_mapper) # Replace 'data_pipe' with 'df_pipe'
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=test_size, random_state=seed, stratify=y
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=val_size, random_state=seed, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
print(y_train.value_counts(normalize=True))
print(y_val.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))
under_sample = RandomUnderSampler(random_state=seed)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
columns_to_drop = [
"clientnum",
"credit_limit",
"dependent_count",
"months_on_book",
"avg_open_to_buy",
"customer_age",
]
# For masking a particular value in a feature
column_to_mask_value = "income_category"
value_to_mask = "abc"
masked_value = "Unknown"
# One-hot encoding columns
columns_to_encode = [
"gender",
"education_level",
"marital_status",
"income_category",
"card_category",
]
# Numerical Columns
num_columns = [
"total_relationship_count",
"months_inactive_12_mon",
"contacts_count_12_mon",
"total_revolving_bal",
"total_amt_chng_q4_q1",
"total_trans_amt",
"total_trans_ct",
"total_ct_chng_q4_q1",
"avg_utilization_ratio",
]
columns_to_null_imp_unknown = ["education_level", "marital_status"]
from sklearn.base import BaseEstimator, TransformerMixin
class FillUnknown(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
return self
def transform(self, X):
X_ = X.copy()
for col in X_.columns:
# Access dtype of the Series using .dtypes
if X_[col].dtypes.name == 'category':
# Check if 'Unknown' is already a category before adding it
if 'Unknown' not in X_[col].cat.categories:
X_[col] = X_[col].cat.add_categories('Unknown')
X_[col] = X_[col].fillna('Unknown')
return X_
columns_to_drop = [
"clientnum",
"credit_limit",
"dependent_count",
"months_on_book",
"avg_open_to_buy",
"customer_age",
"marital_status", # Remove marital_status from columns_to_drop
]
columns_to_drop = [
"clientnum",
"credit_limit",
"dependent_count",
"months_on_book",
"avg_open_to_buy",
"customer_age",
# "marital_status", # Remove this line to keep 'marital_status' in the data
]
feature_name_standardizer = FeatureNamesStandardizer()
# To Drop unnecessary columns
column_dropper = ColumnDropper(features=columns_to_drop)
# To Mask incorrect/meaningless value of a feature
value_masker = CustomValueMasker(
feature=column_to_mask_value, value_to_mask=value_to_mask, masked_value=masked_value
)
# Missing value imputation
imputer = FillUnknown()
# To encode the categorical data
one_hot = OneHotEncoder(handle_unknown="ignore")
columns_to_encode = [
"gender",
"education_level",
"marital_status", # This column should be present for encoding
"income_category",
"card_category",
]
scaler = RobustScaler()
# creating a transformer for feature name standardization and dropping columns
cleanser = Pipeline(
steps=[
("feature_name_standardizer", feature_name_standardizer),
("column_dropper", column_dropper),
("value_mask", value_masker),
("imputation", imputer),
]
)
# creating a transformer for data encoding
encode_transformer = Pipeline(steps=[("onehot", one_hot)])
num_scaler = Pipeline(steps=[("scale", scaler)])
preprocessor = ColumnTransformer(
transformers=[
("encoding", encode_transformer, columns_to_encode),
("scaling", num_scaler, num_columns),
],
remainder="passthrough",
)
# Model
gbm_tuned_model = GradientBoostingClassifier(
n_estimators=700,
max_features="auto",
max_depth=25,
min_samples_split=2,
min_samples_leaf=15,
random_state=seed,
)
columns_to_drop = [
"clientnum",
"credit_limit",
"dependent_count",
"months_on_book",
"avg_open_to_buy",
# "customer_age", # Remove 'customer_age' from columns_to_drop as it's needed in num_columns
# "marital_status", # Remove marital_status from columns_to_drop
]
# ... (Rest of your code remains the same) ...
num_columns = [
"total_relationship_count",
"months_inactive_12_mon",
"contacts_count_12_mon",
"total_revolving_bal",
"total_amt_chng_q4_q1",
"total_trans_amt",
"total_trans_ct",
"total_ct_chng_q4_q1",
"avg_utilization_ratio",
"customer_age", # Include 'customer_age' explicitly in num_columns
]
# ... (rest of the code) ...
# Initialize the OneHotEncoder
from sklearn.preprocessing import OneHotEncoder # Import OneHotEncoder if not already imported
one_hot = OneHotEncoder(handle_unknown="ignore")
# Creating a transformer for data encoding
encode_transformer = Pipeline(steps=[("onehot", one_hot)])
# ... (rest of the code) ...
# ... (rest of the code) ...
# Instantiate the RobustScaler
from sklearn.preprocessing import RobustScaler # Import RobustScaler if not already imported
scaler = RobustScaler() # Create an instance of RobustScaler
# Define the columns_to_encode list
columns_to_encode = [
"gender",
"education_level",
"marital_status",
"income_category",
"card_category",
]
# Create a transformer for data encoding
encode_transformer = Pipeline(steps=[("onehot", one_hot)])
num_scaler = Pipeline(steps=[("scale", scaler)]) # Now 'scaler' is defined
preprocessor = ColumnTransformer(
transformers=[
("encoding", encode_transformer, columns_to_encode),
("scaling", num_scaler, num_columns),
],
remainder="passthrough",
)
# ... (rest of the code) ...
columns_to_drop = [
"clientnum",
"credit_limit",
"dependent_count",
"months_on_book",
"avg_open_to_buy",
"customer_age",
"marital_status", # Remove this line to keep 'marital_status' in the data
]
# ... (rest of the code) ...
# Creating a transformer for data encoding
encode_transformer = Pipeline(steps=[("onehot", one_hot)])
num_scaler = Pipeline(steps=[("scale", scaler)])
preprocessor = ColumnTransformer(
transformers=[
("encoding", encode_transformer, columns_to_encode),
("scaling", num_scaler, num_columns),
],
remainder="passthrough",
)
# ... (rest of the code) ...
print(type(X_train_un))
from sklearn.base import BaseEstimator, TransformerMixin
class FillUnknown(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
return self
def transform(self, X):
X_ = X.copy()
for col in X_.columns:
# Access dtype of the Series using .dtypes
if X_[col].dtypes.name == 'category':
X_[col] = X_[col].cat.add_categories('Unknown')
X_[col] = X_[col].fillna('Unknown')
return X_
# ipython-input-80-1cce1b5a049f
columns_to_drop = [
"clientnum",
"credit_limit",
"dependent_count",
"months_on_book",
"avg_open_to_buy",
"customer_age",
"marital_status", # Remove this line to keep 'marital_status' in the data
]
# ... (rest of your code) ...
feature_name_standardizer = FeatureNamesStandardizer()
# To Drop unnecessary columns
column_dropper = ColumnDropper(features=columns_to_drop)
# To Mask incorrect/meaningless value of a feature
value_masker = CustomValueMasker(
feature=column_to_mask_value, value_to_mask=value_to_mask, masked_value=masked_value
)
# Missing value imputation
imputer = FillUnknown()
# To encode the categorical data
one_hot = OneHotEncoder(handle_unknown="ignore")
scaler = RobustScaler()
# creating a transformer for feature name standardization and dropping columns
cleanser = Pipeline(
steps=[
("feature_name_standardizer", feature_name_standardizer),
("column_dropper", column_dropper),
("value_mask", value_masker),
("imputation", imputer),
]
)
# creating a transformer for data encoding
encode_transformer = Pipeline(steps=[("onehot", one_hot)])
num_scaler = Pipeline(steps=[("scale", scaler)])
preprocessor = ColumnTransformer(
transformers=[
("encoding", encode_transformer, columns_to_encode),
("scaling", num_scaler, num_columns),
],
remainder="passthrough",
)
gbm_tuned_model = GradientBoostingClassifier(
n_estimators=700,
max_features="auto",
max_depth=25,
min_samples_split=2,
min_samples_leaf=15,
random_state=seed,
)
# Creating new pipeline with best parameters
model_pipe = Pipeline(
steps=[
("cleanse", cleanser),
("preprocess", preprocessor),
("model", gbm_tuned_model),
]
)
# Fit the model on training data
model_pipe.fit(X_train_un, y_train_un)
print("Is X_train_un defined?", 'X_train_un' in globals())
print("Is y_train_un defined?", 'y_train_un' in globals())
print(vars().keys()) # Lists all defined variables
X_train_un, X_test_un, y_train_un, y_test_un = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train_un.shape) # Check if it contains data
# To Standardize feature names
feature_name_standardizer = FeatureNamesStandardizer()
# To Drop unnecessary columns
column_dropper = ColumnDropper(features=columns_to_drop)
# To Mask incorrect/meaningless value of a feature
value_masker = CustomValueMasker(
feature=column_to_mask_value, value_to_mask=value_to_mask, masked_value=masked_value
)
# Missing value imputation
imputer = FillUnknown()
# To encode the categorical data
one_hot = OneHotEncoder(handle_unknown="ignore")
# To scale numerical columns
scaler = RobustScaler()
cleanser = Pipeline(
steps=[
("feature_name_standardizer", feature_name_standardizer),
("column_dropper", column_dropper),
("value_mask", value_masker),
("imputation", imputer),
]
)
encode_transformer = Pipeline(steps=[("onehot", one_hot)])
num_scaler = Pipeline(steps=[("scale", scaler)])
preprocessor = ColumnTransformer(
transformers=[
("encoding", encode_transformer, columns_to_encode),
("scaling", num_scaler, num_columns),
],
remainder="passthrough",
)
gbm_tuned_model = GradientBoostingClassifier(
n_estimators=700,
max_features="auto",
max_depth=25,
min_samples_split=2,
min_samples_leaf=15,
random_state=seed,
)
# Creating new pipeline with best parameters
model_pipe = Pipeline(
steps=[
("cleanse", cleanser),
("preprocess", preprocessor),
("model", gbm_tuned_model),
]
)
# Replace "Unknown" with NaN (so it gets handled by the imputer)
X_train_un = X_train_un.replace("Unknown", np.nan)
X_test_un = X_test_un.replace("Unknown", np.nan)
one_hot = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
print(X_train_un.select_dtypes(include="category").apply(lambda x: x.cat.categories))
print("Columns in X_train_un:", X_train_un.columns)
print("Columns sent for encoding:", columns_to_encode)
if "marital_status" not in columns_to_encode:
print("⚠️ Warning: 'marital_status' is missing from encoding step!")
X_train_un_transformed = feature_name_standardizer.transform(X_train_un)
print("Transformed column names:", X_train_un_transformed.columns)
print("Columns in X_train_un:", X_train_un.columns)
print("Columns before column dropper:", X_train_un.columns)
X_train_un = column_dropper.transform(X_train_un)
print("Columns after column dropper:", X_train_un.columns)
X_transformed = feature_name_standardizer.transform(X_train_un)
print("Columns after feature name standardization:", X_transformed.columns)
print("Columns sent for encoding:", columns_to_encode)
# Ensure 'marital_status' is not dropped and is correctly named
# 1. Check the `columns_to_drop` list and remove 'marital_status' if present.
columns_to_drop = [
"clientnum",
"credit_limit",
"dependent_count",
"months_on_book",
"avg_open_to_buy",
"customer_age",
"Marital_Status", # Remove if present and you intend to keep the column
]
# 2. Check for any renaming of 'marital_status' during data processing and adjust the pipeline.
# If you've used FeatureNamesStandardizer, it's likely renamed to 'marital_status'
# Check your ColumnTransformer definition for correct column names:
# Modify `num_columns` and `cat_columns` to reflect correct names:
num_columns = [
"total_relationship_count",
"months_inactive_12_mon",
"contacts_count_12_mon",
"total_revolving_bal",
"total_amt_chng_q4_q1",
"total_trans_amt",
"total_trans_ct",
"total_ct_chng_q4_q1",
]
cat_columns = [
"gender",
"education_level",
"income_category",
"card_category",
"Marital_Status" # Make sure this is included
]
print("Columns in X_train_un:", X_train_un.columns.tolist())
X_transformed = feature_name_standardizer.transform(X_train_un)
print("Columns after feature name standardization:", X_transformed.columns.tolist())
print("Columns to encode:", columns_to_encode)
print("Original dataset columns:", df.columns.tolist()) # Your original DataFrame
print("Columns in X_train_un:", X_train_un.columns.tolist())
print("Original dataset columns:", df.columns.tolist())
print("Columns in data:", data.columns.tolist())
print([col for col in data.columns if "marital" in col.lower()])
print(data.columns.tolist())
data.columns = data.columns.str.strip().str.replace(" ", "_").str.lower()
print(data.columns.tolist()) # Print cleaned column names
print(data.head())
# Ensure 'marital_status' is NOT in 'columns_to_drop'
columns_to_drop = [
"clientnum",
"credit_limit",
"dependent_count",
"months_on_book",
"avg_open_to_buy",
"customer_age",
]
model_pipe.fit(X_train_un, y_train_un)
data["marital_status"] = "Unknown"
X_train_un["marital_status"] = data["marital_status"]
print(columns_to_encode)
columns_to_encode.append("marital_status")
X_train_un["marital_status"] = X_train_un["marital_status"].astype(str).fillna("Unknown")
gbm_tuned_model = GradientBoostingClassifier(
n_estimators=700,
max_features="sqrt", # 🔥 Fixed here
max_depth=25,
min_samples_split=2,
min_samples_leaf=15,
random_state=42,
)
gbm_tuned_model = GradientBoostingClassifier(
n_estimators=700,
max_features="log2",
max_depth=25,
min_samples_split=2,
min_samples_leaf=15,
random_state=42,
)
# Use half the features
gbm_tuned_model = GradientBoostingClassifier(
n_estimators=700,
max_features=0.5,
max_depth=25,
min_samples_split=2,
min_samples_leaf=15,
random_state=42,
)
# Or use a fixed number of features
gbm_tuned_model = GradientBoostingClassifier(
n_estimators=700,
max_features=5,
max_depth=25,
min_samples_split=2,
min_samples_leaf=15,
random_state=42,
)
gbm_tuned_model = GradientBoostingClassifier(
n_estimators=700,
max_features=None,
max_depth=25,
min_samples_split=2,
min_samples_leaf=15,
random_state=42,
)
from sklearn.pipeline import Pipeline # Import Pipeline
from sklearn.ensemble import GradientBoostingClassifier # Import GradientBoostingClassifier
# ... (other pipeline steps) ...
model_pipe = Pipeline(
steps=[
# ... (other pipeline steps) ...,
(
"gbm",
GradientBoostingClassifier(
n_estimators=700,
max_features=None, # Change to a valid value: None, int, float, 'sqrt', 'log2'
max_depth=25,
min_samples_split=2,
min_samples_leaf=15,
random_state=42,
),
),
# ... (other pipeline steps) ...,
]
)
# ... (rest of your code) ...
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import GradientBoostingClassifier
# ... (other imports) ...
# Assuming categorical_features is a list of your categorical column names
categorical_features = X_train_un.select_dtypes(include=['object', 'category']).columns.tolist()
numerical_features = X_train_un.select_dtypes(exclude=['object', 'category']).columns.tolist()
# Create transformers for numerical and categorical features
numerical_transformer = Pipeline(steps=[]) # No transformation for numerical features in this case
categorical_transformer = Pipeline(
steps=[
("onehot", OneHotEncoder(sparse_output=False, handle_unknown='ignore')) # One-hot encode categorical features
]
)
# Combine transformers using ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
("num", numerical_transformer, numerical_features),
("cat", categorical_transformer, categorical_features),
]
)
# Create the final pipeline with the preprocessor and the model
model_pipe = Pipeline(
steps=[
("preprocessor", preprocessor), # Apply preprocessing
(
"gbm",
GradientBoostingClassifier(
n_estimators=700,
max_features=None,
max_depth=25,
min_samples_split=2,
min_samples_leaf=15,
random_state=42,
),
),
]
)
# ... (rest of your code) ...
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler # Import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
# ... (other imports) ...
# Assuming categorical_features is a list of your categorical column names
categorical_features = X_train_un.select_dtypes(include=['object', 'category']).columns.tolist()
numerical_features = X_train_un.select_dtypes(exclude=['object', 'category']).columns.tolist()
# Create transformers for numerical and categorical features
# Add StandardScaler to numerical_transformer
numerical_transformer = Pipeline(steps=[('scaler', StandardScaler())])
categorical_transformer = Pipeline(
steps=[
("onehot", OneHotEncoder(sparse_output=False, handle_unknown='ignore')) # One-hot encode categorical features
]
)
# Combine transformers using ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
("num", numerical_transformer, numerical_features),
("cat", categorical_transformer, categorical_features),
]
)
# Create the final pipeline with the preprocessor and the model
model_pipe = Pipeline(
steps=[
("preprocessor", preprocessor), # Apply preprocessing
(
"gbm",
GradientBoostingClassifier(
n_estimators=700,
max_features=None,
max_depth=25,
min_samples_split=2,
min_samples_leaf=15,
random_state=42,
),
),
]
)
# ... (rest of your code) ...
model_pipe.fit(X_train_un, y_train_un)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.impute import SimpleImputer
# ... (other imports) ...
# Assuming categorical_features is a list of your categorical column names
categorical_features = X_train_un.select_dtypes(include=['object', 'category']).columns.tolist()
numerical_features = X_train_un.select_dtypes(exclude=['object', 'category']).columns.tolist()
# Create transformers for numerical and categorical features
# Add StandardScaler to numerical_transformer
numerical_transformer = Pipeline(steps=[('scaler', StandardScaler())])
categorical_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy='most_frequent')), # Impute missing values before OneHotEncoding
("onehot", OneHotEncoder(sparse_output=False, handle_unknown='ignore')) # One-hot encode categorical features
]
)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.impute import SimpleImputer
# ... (other imports) ...
# Assuming categorical_features is a list of your categorical column names
categorical_features = X_train_un.select_dtypes(include=['object', 'category']).columns.tolist()
numerical_features = X_train_un.select_dtypes(exclude=['object', 'category']).columns.tolist()
# Create transformers for numerical and categorical features
# Add StandardScaler to numerical_transformer
numerical_transformer = Pipeline(steps=[('scaler', StandardScaler())])
categorical_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy='most_frequent')), # Impute missing values before OneHotEncoding
("onehot", OneHotEncoder(sparse_output=False, handle_unknown='ignore')) # One-hot encode categorical features
]
)
# Combine transformers using ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
("num", numerical_transformer, numerical_features),
("cat", categorical_transformer, categorical_features),
])
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.impute import SimpleImputer
# ... (other imports) ...
# Assuming categorical_features is a list of your categorical column names
categorical_features = X_train_un.select_dtypes(include=['object', 'category']).columns.tolist()
numerical_features = X_train_un.select_dtypes(exclude=['object', 'category']).columns.tolist()
# Create transformers for numerical and categorical features
# Add StandardScaler to numerical_transformer
numerical_transformer = Pipeline(steps=[('scaler', StandardScaler())])
categorical_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy='most_frequent')), # Impute missing values before OneHotEncoding
("onehot", OneHotEncoder(sparse_output=False, handle_unknown='ignore')) # One-hot encode categorical features
]
)
# Combine transformers using ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
("num", numerical_transformer, numerical_features),
("cat", categorical_transformer, categorical_features),
]
)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.impute import SimpleImputer
# ... (other imports) ...
# Assuming categorical_features is a list of your categorical column names
categorical_features = X_train_un.select_dtypes(include=['object', 'category']).columns.tolist()
# Define numerical_features by selecting columns that are not categorical
numerical_features = X_train_un.select_dtypes(exclude=['object', 'category']).columns.tolist()
# ... (rest of the code) ...
print(X_test.shape)
if X_test.shape[0] == 0:
print("Error: X_test is empty!")
# Try using a subset of the training set as a fallback
X_test = X_train_un[:10] # Use first 10 rows as a workaround
y_test = y_train_un[:10]
print(f"X_train_un shape: {X_train_un.shape}")
print(f"y_train_un shape: {y_train_un.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")
print(set(X_train_un.columns) - set(X_test.columns)) # Columns in train but missing in test
print(set(X_test.columns) - set(X_train_un.columns)) # Columns in test but missing in train
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")
correct_column_name = "attrition_flag" # Replace with the actual column name you want to drop
X_train_un = data.drop(columns=[correct_column_name])
target_col = [col for col in data.columns if "target" in col.lower()]
print(target_col) # Check which column(s) match
if target_col:
X_train_un = data.drop(columns=target_col[0]) # Use the found column
else:
print("⚠️ No column found with 'target' in the name!")
print(type(data)) # Should be <class 'pandas.DataFrame'>
print(data.shape) # Check number of rows/columns
if "target_column" in data.columns:
X_train_un = data.drop(columns=["target_column"])
else:
print("⚠️ 'target_column' not found! Available columns:", data.columns)
data.rename(columns={"wrong_column_name": "target_column"}, inplace=True)
print(data.columns) # Print the column names
X_train_un = data.drop(columns=["attrition_flag"])
from sklearn.preprocessing import OneHotEncoder
# ... other imports ...
# ... your pipeline definition ...
categorical_features = ['gender', 'education_level', 'marital_status', 'income_category', 'card_category']
categorical_transformer = Pipeline(
steps=[
("onehot", OneHotEncoder(sparse_output=False, handle_unknown='ignore')) # handle_unknown='ignore' added
]
)
# Define the column transformer
preprocessor = ColumnTransformer(
transformers=[
("num", robust_scaler, num_columns),
("cat", categorical_transformer, categorical_features),
]
)
# Define the pipeline
model_pipe = Pipeline(
steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())]
)
# ... rest of your code ...
categorical_features = ['gender', 'education_level', 'income_category', 'card_category']
# Find the actual column name if it exists with different casing
for col in X_train.columns:
if col.lower().strip() == 'marital_status':
categorical_features.insert(2, col) # Insert at the correct position
break # Exit the loop once found
# If the column is still not found, print a warning and proceed without it
if 'marital_status' not in [c.lower().strip() for c in categorical_features]:
print("Warning: 'marital_status' column not found. Proceeding without it.")
categorical_transformer = Pipeline(
steps=[
("onehot", OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
]
)
preprocessor = ColumnTransformer(
transformers=[
("num", robust_scaler, num_columns),
("cat", categorical_transformer, categorical_features),
]
)
model_pipe = Pipeline(
steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())]
)
# Fit the pipeline to the training data
model_pipe.fit(X_train, y_train)
# Now you can score on the test data
print(
"Accuracy on Test is: {}%".format(round(model_pipe.score(X_test, y_test) * 100, 0))
)
pred_train_p = model_pipe.predict_proba(X_train)[:, 1] > 0.5 # Use X_train instead of X_train_un
pred_test_p = model_pipe.predict_proba(X_test)[:, 1] > 0.5
pred_train_p = np.round(pred_train_p)
pred_test_p = np.round(pred_test_p)
train_acc_p = accuracy_score(y_train, pred_train_p) # Use y_train instead of y_train_un
test_acc_p = accuracy_score(y_test, pred_test_p)
train_recall_p = recall_score(y_train, pred_train_p) # Use y_train instead of y_train_un
test_recall_p = recall_score(y_test, pred_test_p)
print("Recall on Test is: {}%".format(round(test_recall_p * 100, 0)))
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd # Make sure pandas is imported
# ... (Your previous code) ...
# Assuming 'data' is your original DataFrame, create 'data_clean'
# Replace this with the appropriate operations to clean your data
data_clean = data.copy() # Example: Create a copy of 'data'
# ... (Rest of your code to generate the heatmap) ...
# Select only numerical features for correlation calculation
numerical_data = data_clean.select_dtypes(include=np.number)
mask = np.zeros_like(numerical_data.corr(), dtype=bool) # Use numerical_data.corr()
mask[np.triu_indices_from(mask)] = True
sns.set(rc={"figure.figsize": (15, 15)})
sns.heatmap(
numerical_data.corr(), # Use numerical_data.corr()
cmap=sns.diverging_palette(20, 220, n=200),
annot=True,
mask=mask,
center=0,
)
plt.show()
mask = np.zeros_like(numerical_data.corr(), dtype=bool) # Change np.bool to bool
# Assuming X_test and X_train have the same columns but some are categorical
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Identify categorical features (e.g., 'gender', 'education_level')
# 'marital_status' has been removed from the list as it is causing the error
categorical_features = ['gender', 'education_level', 'income_category', 'card_category']
# Create a ColumnTransformer to apply OneHotEncoder to categorical features
preprocessor = ColumnTransformer(
transformers=[
('num', 'passthrough', [col for col in X_train.columns if col not in categorical_features]), # Passthrough for numerical
('cat', OneHotEncoder(sparse_output=False, handle_unknown='ignore'), categorical_features), # One-hot encode categorical
])
# Create a pipeline with the preprocessor and your model
model_pipe = Pipeline([
('preprocessor', preprocessor),
('classifier', GradientBoostingClassifier(random_state=42)), # Replace with your desired model
])
# Fit the pipeline to your training data
model_pipe.fit(X_train, y_train)
# Now you can make predictions on your test data
y_pred_gb = model_pipe.predict_proba(X_test)[:, 1]
# ... (rest of your code for other models and y_pred_all calculation) ...
from sklearn.metrics import average_precision_score, roc_auc_score
# Assuming 'model_pipe' is your trained model and 'X_test' is your test data
y_pred_all = model_pipe.predict_proba(X_test)[:, 1] # Get predicted probabilities for class 1
# Now you can use the function
average_precision_score(y_test, y_pred_all), roc_auc_score(y_test, y_pred_all)
Insights and Recommendations:
Insights:
1.The proportion of attrited customers by gender there are 14.4% more male than female who have changed.
Recommendation: 1.There should be a good coordinatoon and relationship betweek Bank and customer.